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Abstract 

In this paper we present first results from 
a comparative study. Its aim is to test 
the feasibihty of different inductive learn- 
ing techniques to perform the automatic 
acquisition of linguistic knowledge within a 
natural language database interface. In our 
interface architecture the machine learn- 
ing module replaces an elaborate semantic 
analysis component. The learning module 
learns the correct mapping of a user's input 
to the corresponding database command 
based on a collection of past input data. 
We use an existing interface to a produc- 
tion planning and control system as evalu- 
ation and compare the results achieved by 
different instance-based and model-based 
learning algorithms. 

1 Introduction 



One of the main obstacles to the efficient use of nat 



ural language interfaces is the often required high 



amount of manual knowledge engineering (see (An- 



droutsopoulos et al., 1995) for a recent survey). This 



time-consuming and tedious process is often referred 
to as "knowledge acquisition bottleneck" . It may re- 
quire extensive efforts by experts highly experienced 
i n linguistics as well as in the domain and the task 
( Riloff and Lehnert, 1994 ). Therefore, natural lan- 
guage interfaces represent a domain that is very well 
suited for the application of machine learning algo- 
rithms to automate the acquisition process of lin- 
guistic knowledge. 

So far, inductive learning has already been ap- 
plied successfully to a large variety of natural lan- 
guage tasks. This includes basic linguistic problems 
such as morphologi cal analysis ( van den Bosch et 



al., 1996 ), parsing ( Zelle and Mooncy, 1996 ), word 



sense disambiguation ( Mooney, 1996 ) , and anaphora 



resolution (Aone and Bennett, 1996). Besides this, 
there also exists some research on applications, e.g. 
machine translation ( Yamazaki et al., 1996]), text 



categorization (Moulinier and Ganascia, 1996), or 
information extraction ( Sodcrland et al., 1996| ). 

The learning task in natural language interfaces is 
to select the correct command class based on seman- 
tic features extracted from the user input. There- 
fore, it can be modeled as classification problem, i.e. 
the machine learning algorithms construct a theory 
from the training data that is used for classifying 
unseen test data (Quinlan, 1996). So far, we con- 
sider only supervised learning so that each training 
case has to be labeled with the correct class. 

We apply different existing instance-based and 
model-based algorithms to this problem and com- 
pare the achieved results. In addition, we have 
also developed several new algorithms, which we 
present briefly in this paper. We have implemented 
all algorithms by means of the deductive object- 
oriented database system ROCK & ROLL (Barja 
"199^ ). 



ct al. 



It solves the problem of updates in deductive 
databases in that it separates the declarative logic 
query language ROLL from the imperative data ma- 
nipulation language ROCK within the context of a 
common object-oriented data model. Besides this, 
ROCK & ROLL makes a clean distinction between 
type declarations, which describe the structural char- 
acteristics of a set of instance objects and the meth- 
ods that can be applied to them, and class defini- 
tions, which specify the implementation of the meth- 
ods associated with a type. 

The use of the available powerful logic and object- 
oriented programming language enables an efficient 
implementation of the different approaches to ma- 
chine learning. It also gives us a convenient in- 
tegrated tool that assists in applying the machine 
learning algorithms to the data collection stored in 
the same database. 



Figure 1: System architecture of natural language interface 



As comparative evaluation of the implemented 
algorithms, we applied them to an extensive case 
study: a natural language interface for a production 
planning and control system. The system is used 
in a multilingual environment, which includes the 
languages English, German, and Japanese. There- 
fore, an important issue of the evaluation was to 
check whether the learned knowledge is language- 
independent, i.e. if it really operates based on se- 
mantic deep forms so that it abstracts from linguistic 
surface phenomena. 

The rest of the paper is organized as follows. 
First, we briefly introduce the learning task before 
we present the applied machine learning algorithms 
in more detail. Finally, we explain the set-up of 
the case study and discuss the achieved results from 
evaluation. 

2 Learning Task 

Our interface architecture is displayed in Fig. ^. It 
represents a multilingual database interface for the 
languages English, German, and Japanese. First, 
the language of the user input is detected and the 



input is transferred to the corresponding language- 
specific morphological and lexical analyzer. 

Morphological and lexical analysis performs the 
tokenization of the input, i.e. the segmentation into 
individual words or tokens. This task is not always 
trivial as in the case of Japanese, which uses no 
spaces for separating words. As next step the input 
is transformed into a deep form list (DFL), which 
indicates for each token its surface form, category, 
and semantic deep form. 

For database interfaces, unknown values con- 
tained in the input possess particular importance 
for the meaning of a command. Therefore, we treat 
those unknown values separately in the unknown 
value list (UVL) analyzer. This module checks the 
data type of unknown values and looks them up in 
the database to find out whether they represent iden- 
tifiers of existing entities. In such a case, the entity 
type is indicated in the resulting UVL, otherwise we 
use the data type instead. 

DEL and UVL represent the input to the machine 
learning (ML) classifier. It assigns a ranked list 
of command classes to the input sentence accord- 
ing to the learned classification rules. As last step 



Figure 2: Example of feature encoding 



the classifications are used for generating appropri- 
ate database commands. 

For the encoding of the training data we only make 
use of the semantic deep forms contained in the DFL. 
We use English concepts as deep forms and map 
them to binary features, i.e. a certain feature equals 



stances, which results in a high computation cost for 
classification. 

Different instance-based algorithms vary in how 
they assess the similarity (or distance) between two 
instances. Two very commonly used methods are 
IBl (|Aha et al., 199l| ) and IBl-IG ( [Paelemans "md 



1 if the deep form is a member of the DFL, otherwise van den Bosch, 1992 ). Whereas IBl applies the sim 



it equals 0. For the elements of the UVL we apply 
a more detailed encoding, which maps the number 



and the type to binary features. Figure 2 shows an 
example of the features derived from English, Ger- 
man, and Japanese input sentences for the update 
of the purchase price for a material. 

Thus, the learning task replaces an elaborate se- 
mantic analysis of the user input. The development 
of the corresponding underlying rule base might re- 
quire several man- months. The learning task rep- 
resents a realistic real-life application, which differs 
from many other problems studied in machine learn- 
ing research in that it consists of a large number 
of features and classes. Furthermore, the command 
classes are often very similar and even for human 
experts very difficult to distinguish. 

3 Learning Algorithms 

3.1 Instance- based Learning 

Instance-based approaches represent the learned 
knowledge simply as collection of training cases or 
instances. For that purpose they use the same lan- 
guage as for the description of the training data 
uinlan, 1993a). A new case is then classified by 



finding the instance with the highest similarity and 
using its class as prediction. Therefore, instance- 
based algorithms are characterized by a very low 
training effort. On the other hand, this leads to a 
high storage requirement because the algorithm has 
to keep all training cases in memory. Besides this, 
one has to compare new cases with all existing in- 



ple approach of treating all features as equally im- 



portant, IBl-IG uses the information gain (Quinlan 



1986) of the features as weighting function. 



We have developed an algorithm called BIN -CAT 
for binary features with class-dependent weighting 
and asymmetric treatment of the feature values. The 
similarity between a new case X and a training 
case Y is calculated according to the following for- 
mula: 



SIMjf^y = y^p(A, Cy) ■ Wj ■ a{x^,yi) - 

n 

(A, Cy) -Wi- Sy {Xi,yi) - 

i=l 
n 

'^[1 - p{Di,Cy)] ■ Wi ■ Sx {xi,yi) . 

(1) 

In this formula, n indicates the number of fea- 
tures, Di the number of instances that have value 1 
for feature i, and Cy the class of the training case Y. 
The term p{Di, Cy) then denotes the proportion of 
instances in Di that belong to class Cy. a{xi,yi), 
5y{xi,yi), and 6x{xi,yi) are determined as follows: 



(^[xi,yi) = 
Syix^,yi) = 



1 if Xi = 1 A yi — 1 
otherwise 



1 if Xt = A yi 
otherwise 



1 



Sxixi,yi) = 



1 if Xi = 1 A = 
otherwise 



(2) 



so that the second sum in (Q) is rated higher for 
a larger number of occurrences of the ith feature 
for class Cy whereas the third sum is rated lower. 
This means that if the training case Y contains a 
certain feature and the new case X does not, then 
we rate this difference the stronger the more often 
the feature occurs for class Cy. On the other hand, 
for features appearing in the new case X but not 
in Y, the opposite is true. 

Finally, Wi represents the weight of feature i. It is 
calculated by making use of the following formula: 



4.p(A,j)-[l-p(A,j)] . (3) 



The term under the summation symbol represents 
the selectivity of feature i for class j. It equals 1 
if either all or none of the cases have value 1 for 
this feature. In other words, all instances for class j 
then either possess or do not possess this feature, 
which makes it a very discriminative characteristic. 
The other extreme is that p{Di,j) equals 50%. In 
that case, this feature allows for no prediction of 
the class and the term under the summation symbol 
becomes 0. 

We have implemented all above-mentioned algo- 
rithms for binary features in ROCK & ROLL in 
that we store the instances as objects and assign 
to them the features as ordered lists sorted by the 
feature numbers. The calculation of the similarity 
between two cases is then realized as method invo- 
cation on the feature list. For example. Fig. || shows 
the ROCK method to compute the distance between 
two feature lists according to IBl. 

Besides pure instance-based learning we have also 
developed an algorithm BIN- PRO, which creates a 
prototype for each class. Those prototypes are then 
used for the comparison with new cases. This has 
the big advantage that one does not have to store 
all the training instances and that the number of 
required comparisons for classification is reduced to 
the number of existing classes. As similarity func- 
tion between a new case X and a certain class C we 
use the following formula: 



SIMx.c = J2\^c\-p{Df,C)-Wf 
fex 

Y,p{Df,C)-Wf . 



In this formula, we give more emphasis to fea- 
tures / that are present in X in that we multiply 
them by \Dc\, the number of instances for class C. 
However, the second sum takes also important fea- 
tures for class C into account that are missing in 
the new case X. As weighting function Wf we use 
again (H). The implementation in ROCK & ROLL is 
performed by creating an object for each prototype 
and by invoking the associated method for comput- 
ing the similarity to a new test case. 

3.2 Model-based Learning 

In contrast to instance-based learning, model-based 
approaches represent the learned knowledge in a the- 
ory language that is richer than the lang uage used 



for the description of the training data (Quinlan 



1986 ) . Such learning methods construct explicit gen- 



(4) 



eralizations of training cases resulting in a large re- 
duction of the size of the stored knowledge base and 
the cost of testing new test cases. 

In our research we consider the subtypes of deci- 
sion trees and rule-based learning as well as hybrid 
approaches between them. The main difference be- 
tween the various methods for constructing decision 
trees is the selection of the feature for splitting a 
node. The following two main categories are distin- 
guished: 

• static splitting: selects the best feature for split- 
ting always on the basis of the complete collec- 
tion of instances, 

• dynamic splitting: re-evaluates the best feature 
for splitting for each node based on the current 
local set of instances. 

Static splitting requires less computational effort 
because it performs the feature ranking only once 
for the construction process. However, it entails 
overhead to keep track of already used features and 
to eliminate features that provide no proper split- 
ting of the set of instances. Besides that, dynamic 
splitting methods produce much more compact trees 
with fewer nodes, leaves, and levels. This results in 
a sharp reduction of the storage requirement as well 
as the number of comparisons during classification. 

We have implemented decision trees for static 
(BS-tree) and dynamic splitting (BD-tree) by us- 
ing the weighting function (|^) as ranking scheme 
for the splitting criterion. In addition, we have 
also implemented the IGTree algorithm (Daelemans 
|et al., 1997 ), which uses the inf ormation gain a s 
static splitting criterion, and C4-5 ( Quinlan, 1993b ), 
which applies the information gain to dynamic 
splitting. The decision trees are implemented in 



Figure 3: ROCK & ROLL code segment for IBl distance calculation 



ROCK & ROLL by creating an object for each node 
and by linking the nodes according to the tree struc- 
ture. The classification of a new case is then simply 
performed as top-down traversal of the tree starting 
from the root. Besides this exact search we have also 
implemented an approximate search method, which 
allows one incorrect edge along the traversa.l to find 



to feature tests with positive or negative sign. This 
means that they check whether a new case possesses 
a certain feature (for positive tests) or not (for neg- 
ative tests). 

The methods for deriving the rules ori ginate from 



the fi eld of inductive logic programming ( Mugglcton 



a larger number of similar ca,ses. 



1992 ). One of the most pr ominent algorithms for 
rule-based learning is FOIL ( [Quinlan and Cameron ' 



Rule-based learning represents a second large cat- Jones, 1995), which learns for each class a set of 



egory of model-based techniques. It aims at deriving 
a set of rules from the instances of the training set. 
A rule is here defined as a conjunction of literals, 
which, if satisfied, assigns a class to a new case. For 
the case of binary features, the literals correspond 



rules by applying a separate-and-conquer strategy. 
The algorithm takes the instances of a certain class 
as target relation. It iteratively learns a rule and re- 
moves those instances from the target relation that 
are covered by the rule. This is repeated until no in- 



Figure 4: ROCK & ROLL code segment for test of rules 



stances are left in the target relation. A rule is grown 
by repeated specialization, adding literals until the 
rule does not cover any instances of other classes. 
In other words, the algorithm tries to find rules that 
possess some positive bindings, i.e. instances that be- 
long to the target relation, but no negative bindings 
for instances of other classes. Therefore, the reason 
for adding a literal is to increase the relative propor- 
tion of positive bindings. 

As weighting function for selecting the next literal, 
FOIL uses the information gain. We have imple- 
mented FOIL, and besides this, we also use the algo- 
rithm BIN-rules with the following weighting func- 
tion: 

Wf,s,c^b+ -{b- ~b-j)-wf^s,c ■ (5) 



In this formula, s indicates the sign of the feature 
test. The number of positive (negative) bindings 
after adding the literal for the test of feature / is 
written as b^ {bj ) . Finally, b~ indicates the number 
of negative bindings before adding the literal so that 
b~ — bJ calculates the reduction of negative bindings 
achieved by adding the literal. The weights Wf^s.c 
are calculated as class-dependent weights for class C 
by making use of the feature weights Wf from (||): 

J Wf-p{Df,C) if s positive , , 

"^f'^'C - \ Wf ■[l-p{Df,C)] otherwise. ^> 

We have implemented the test of rules as deduc- 
tive ROLL method as shown in Fig. ^. The invoca- 
tion of the method is a query with the parameter f 1 



for the feature list of the new case. The test returns 
false for those rules that are satisfied by the new 
case. The result of the query can then be assigned to 
the set of satisfied rules rs by using the command: 
rs := [{R} I -^diff er( !fl)@R] ; . As in the case of 
decision trees, we have developed an approximate 
test, which tolerates one divergent literal. 

As last group of model-based algorithms we look 
at hybrid approaches between decision trees and rule- 
based learning. There exist two ways in principle 
to combine the advantages of the two paradigms. 
The first one is to extract rules from a decision tree 
whereas the second one follows the opposite direc- 
tion by constructing a decision tree from a rule base. 

As example of the first ty pe of approach we have 
implemented C4. 5- RULES dQuinlan, 1993bD , which 
extracts rules from the decision tree built by C4.5. 
Rules are computed as paths along the traversal 
from the root to all leaves. In a second run, rules are 
pruned by removing redundant literals and rules. 

Regarding the second type of approach, we start 
from the rule base produced by BIN-rules and use 
it for building an SE-tree (Elymon, 1993). SE-trees 
are a generalization of decision trees in that they 
allow not only one but several feature tests at one 
node. Therefore, a much flatter and more compact 
tree structure is achieved. For the construction of 
the tree we sort the feature tests of the rules first. 
Starting from a root node, we then construct paths 
according to the literals of the individual rules. For 
this process we make use of existing paths as far as 
possible before creating new branches. 

4 Evaluation 

As case study for investigating the feasibility of the 
implemented machine learning algorithms, we use a 
multilingual natural language interface to a produc- 
tion planning and control system (PPC). The PPC 
performs the mean-term scheduling of products and 
resources involved in the manufacturing processes, 
i.e. material, machines, and labor. The resulting 
master production schedule forms the basis of the 
coordination of related business services such as en- 
gineering, manufacturing, and finance. The modeled 
enterprise makes precision tools by using job order 
production and serial manufacture as basic strate- 
gies. The efficient realization of the high demands 
of the application exceeds the power of relational 
database technology. Therefore, it represents an 
excellent choice for deriving full advantage of the 
extended functionality of deductive object-oriented 
database systems. Furthermore, the sophisticated 
functionality justifies the effective use of a natural 
language interface. 



During previous research (Winiwarter, 1994) we 
developed a German natural language interface 
based on 1000 input sentences that had been col- 
lected from users by means of questionnaires. The 
input sentences were then mapped to 100 command 
classes (10 for each class). The mapping was per- 
formed by elaborate semantic analysis; for the devel- 
opment of the underlying rule base we spent several 
man- months. 

Therefore, we were eager to see if we could replace 
this extensive effort by a machine learning compo- 
nent that learns the same linguistic knowledge auto- 
matically. For this purpose we divided the 1000 sen- 
tences into 900 training cases and 100 test cases. In 
addition, we collected 100 Japanese and 100 English 
test sentences to check whether the learned knowl- 
edge really operates at a semantic level independent 
from language-specific phenomena. 

As result of the encoding of the training set (see 
Sect. we obtained the large number of 316 fea- 
tures, 289 for the DEL and 27 for the UVL. For 
the evaluation of the different machine learning algo- 
rithms we used as performance measures the success 
rate, i.e. the proportion of correctly classified test 
cases, and the top-3 rate. The latter indicates the 
proportion of cases where the correct classification is 
among the first three predicted classes. For the case 
of model-based approaches we had to produce addi- 
tional candidates for classes. This was achieved by 
applying approximate methods that allow one incor- 
rect edge along the traversal of decision trees or one 
divergent literal for the test of rules (see Sect. ^^). 

Our first experiment was the comparison of the 
four instance-based algorithms IBl, IBl-IG, BIN- 
CAT, and BIN-PRO. As can be seen from the results 
in Table |l|, BIN-CAT clearly outperforms IBl and 
IBl-IG. Concerning the method BIN-PRO, which 
uses prototypes of classes, we achieved results at the 
same quality level as for BIN-CAT. This is remark- 
able if one considers the much more condensed rep- 
resentation of the learned knowledge. 

The comparison between the results for the indi- 
vidual languages shows that there is no advantage 
for the German test sentences. On the contrary, 
the test results for German are inferior to that for 
English or Japanese. This may be partly due to 
a greater deviation of the German expressions and 
phrases used in the test set from the ones used in 
the training set. Besides this, the restriction of ex- 
tracted features during encoding the test set to those 
learned from the training set certainly performs an 
important filtering function. It removes language- 
specific syntactic particles that do not contribute to 
the meaning of the input. This is especially true 





GERMAN 


ENGLISH 


JAPANESE 




Success rate 


Top-3 rate 


Success rate 


Top-3 rate 


Success rate 


Top-3 rate 


IBl 


82% 


94% 


98% 


99% 


94% 


98% 


IBl-IG 


84% 


98% 


97% 


100% 


90% 


99% 


BIN-CAT 


94% 


100% 


99% 


100% 


99% 


100% 


BIN-PRO 


95% 


100% 


97% 


100% 


97% 


100% 



Table 1: Test results for instance-based learning 





GERMAN 


ENGLISH 


JAPANESE 




Success rate 


Top-3 rate 


Success rate 


Top-3 rate 


Success rate 


Top-3 rate 


IGTree 


80% 


94% 


92% 


100% 


86% 


97% 


BS-tree 


86% 


97% 


95% 


100% 


90% 


96% 


C4.5 


94% 


100% 


94% 


100% 


89% 


100% 


BD-tree 


93% 


99% 


94% 


99% 


91% 


99% 


SE-tree 


94% 


97% 


96% 


97% 


91% 


95% 



Table 2: Test results for decision trees 



for the case of Japanese sentences, which possess a 
completely different syntactic structure in compari- 
son with English or German including many parti- 
cles with no equivalent words in the other two lan- 
guages. 

The second part of the evaluation was the com- 
parison of the four algorithms for building decision 
trees: IGTree, BS-tree, C4.5, and BD-tree. Besides 
this, we also included the SE-tree constructed by a 
hybrid approach (see Sect. 3.2). The test results in 



Table ^ indicate that the trees with dynamic split- 
ting are superior to those with static splitting and 
that C4.5, BD-tree, and SE-tree produce results of 
similar quality. Table ^ compares the number of 
nodes, leaves, and levels for the individual trees. 
The two trees with dynamic splitting are much more 
compact than those with static splitting, with C4.5 
clearly outperforming BD-tree. Finally, the hybrid 
SE-tree is much flatter than C4.5 but possesses a 
larger number of nodes and leaves. 





Nodes 


Leaves 


Levels 


IGTree 


865 


433 


33 


BS-tree 


719 


360 


86 


C4.5 


339 


170 


26 


BD-tree 


451 


226 


52 


SE-tree 


559 


209 


8 



Table 3: Characteristics for decision trees 

As last part of our comparative study we tested 
the rule-based techniques FOIL, BIN-rules, and the 
hybrid approach C4.5-RULES. As Table | shows, 
FOIL produces the most compact representation of 



learned knowledge, followed by C4.5-RULES and 
BIN-rules. However, accor ding to Table I both BIN- 
rules and C4.5-RULES outperform FOIL with al- 
most identical results. 





Rules 


Literals 


Max. length 


FOIL 


215 


534 


5 


BIN-rules 


209 


726 


7 


C4.5-RULES 


167 


677 


24 



Table 4: Characteristics for rule-based learning 



An advantage of rule-based learning in compari- 
son with other methods is that the learned knowl- 
edge can be easily presented to the user in a clear 
and understandable form. The derived rules allow a 
transparent knowledge representation that one can 
use for explaining decisions of the system to the user. 
Figure ^ gives some examples of rule sets learned by 
BIN-rules for several command classes. 

If we take a final look at Table |l|. Table ^ and 
Table ^, we can see that independent from the ap- 
plied machine learning paradigm the achieved results 
reached satisfactory quality for all three groups. By 
considering the three best representatives BIN-CAT, 
C4.5, and BIN-rules, we obtain an average success 
rate for all three languages of 94.3 % and a top-3 rate 
of 98.8%. This result is surprisingly high if one con- 
siders the complexity of the task at hand. Unfortu- 
nately, we had no possibility of a direct comparison 
with the results of the hand-engineered interface be- 
cause the previous interface had been developed only 
for German based on the complete collection of 1000 
sentences by using a different software. In any case. 





GERMAN 


ENGLISH 


JAPANESE 




Success rate 


Top-3 rate 


Success rate 


Top-3 rate 


Success rate 


Top-3 rate 


FOIL 


85% 


97% 


92% 


97% 


88% 


96% 


BIN-rulcs 


94% 


97% 


95% 


97% 


91% 


95% 


C4.5-RULES 


94% 


98% 


94% 


96% 


91% 


96% 



Table 5: Test results for rule-based learning 



Figure 5: Examples of learned rules 



we could show that machine learning represents a 
sound alternative to manual knowledge acquisition 
for the application in natural language interfaces. 

5 Conclusion 

In this paper we have presented first results from 
a comparative study of applying different inductive 
learning techniques to natural language interfaces. 
We have implemented a representative selection of 
instance-bascid and model-based algorithms by mak- 
ing use of deductive object-oriented database func- 
tionality. The extensive case study for an inter- 
face to a production planning and control system 
shows the feasibility of the approach in that linguis- 
tic knowledge is learned the acquisition of which nor- 
mally takes a large effort of human experts. 

Future work will concentrate on the important 
point of increasing the reliability of test results in 
that we apply cross-validation trials and statistical 
tests for the significance of performance differences 
between two algorithms. Furthermore, we also want 
to generate learning functions that plot success rates 
as function of the size of the training collection. Be- 



sides this, we plan to test our learning algorithms on 
standard benchmark machine learning datasets and 
other typical natural language learning datasets. 

Finally, we intend to extend the implemented al- 
gorithms to include also unsupervised methods as 
well as connectionist and evolutionary techniques. 
In addition, we will implement incremental learning 
techniques, which continue the learning process dur- 
ing the test phase, and adaptive boosting methods, 
which apply several classifiers instead of just one. 
We believe that our study is a first promising step 
towards the challenging task of carrying out compar- 
ative evaluations of the performance of different ma- 
chine learning algorithms for specific linguistic prob- 
lems. 
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